The metropolis algorithm is the simplest Markov Monte Carlo Chain (MCMC) algorithm. Even though it is not used anymore because it is not very efficient, it is useful to explain the key components of MCMC algorithms.
The goal of an MCMC algorithm is to generate a posterior distribution, i.e. we want to calculate how probable parameters are given the data, the likelihood and the prior.
To introduce the Metropolis algorithm, we use a simple example where we try to estimate the mean and standard deviation of these data:
set.seed(1)
y = rnorm(100)
hist(y)
As a first step, we need to calculate the probability of a parameter given the data, likelihood and priors. Here, we use the by now hopefully well known ingredients:
calc.P = function(mu,log_sigma,x) {
return(
dnorm(x, mu, exp(log_sigma), log = T) %>% sum() + # likelihood
dnorm(mu, mean = 0, sd = 1, log = T) + # prior for the mean
dnorm(log_sigma, mean = 0, sd = 1, log = T) # prior for the sd
)
}
As is generally done in Bayesian computation, we do the computations on the log scale, so multiplication becomes addition.
(Putting a normal prior on log_sima and later
exponentiation is slightly unusual. The reason this is done here is that
I wanted to avoid non-symmetric proposal on the prior for the error
variance.)
To start the sampling process, we set some initial parameter values.
We store the initial values in vectors that will also hold our posterior
samples. The vector P holds the log posteriors.
# number of samples we want to draw
iter = 4000
# vectors to store simulation results
post.mu = vector(length = iter)
post.log_sigma = vector(length = iter)
P = vector(length = iter)
# initialize parameter values (usually done ranomly)
post.mu[1] = -2
post.log_sigma[1] = 1.5
# calculate log posterior
P[1] = calc.P(post.mu[1], post.log_sigma[1], y)
Proposals for the Metropolis algorithm have to be symmetric around
the current parameter values. Here we use a normal distribution
with a standard deviation of 0.5.
This standard deviation is an important parameter because if it is to small and we initialize far away from the posterior distribution the algorithm will take a lot of time to find the posterior distribution. On the other hand, if the standard deviation is to large, the algorithm will make lots of proposals that are rejected, which also slows things down.
One rule of thumb is to set the standard deviation such that around 80% of the proposals are accepted. This is partially a process of trial and error, though it helps to choose reasonable priors and to choose standard deviations for proposals that are consistent with the priors. _Luckily, modern MCMC samplers are tuned automatically.
i = 2 # current iteration
proposal.mu =
rnorm(1,
mean = post.mu[i-1],
sd = 0.05)
proposal.log_sigma =
rnorm(1,
mean = post.log_sigma[i-1],
sd = 0.05)
Because our goal is to describe a posterior distribution, we should choose more of those samples that have a high log posterior.
On the other hand, we also want explore the space of possible parameters and should therefor also accept samples that have lower log posteriors. We also do not simply want to find the maximum a posteriori, i.e. the parameter combination at which the log posterior is highest, but we want to know the (relative) probability of parameters combinations.
The decision about wether the proposal is accepted as the new sample depends on the relative log posteriors of the old sample and the proposal. So we next calculate the log posterior for the new sample.
proposal.P = calc.P(proposal.mu, proposal.log_sigma, y)
The decision rule about accepting the proposal as the next samples is as follows:
This was the decision rule in words, here it is in code:
First, we want to know what is the relative probability of the data under the proposal, compared to under the last sample.
c(log_post.last_sample = P[1], log_post.proposal = proposal.P)
## log_post.last_sample log_post.proposal
## -259.9161 -260.4642
Because we are on the log scale, we subtract and then take the exponent:
exp(proposal.P-P[1])
## [1] 0.578066
So the probability of the proposal is 0.58 time the probability of the last sample.
Because the proposal has a lower probability, we choose randomly:
r = runif(1)
accept =
ifelse(
r < exp(proposal.P-P[1]) ,
TRUE, FALSE)
accept
## [1] TRUE
In this case we accept the proposal.
Now that we have decided about the first sample, we generate a new proposal to generate the next sample.
Un-hide the next code block to see how to generate 4000 samples.
last.P = proposal.P
for (k in 2:iter) {
# generate proposal and calculate log_posterior
proposal.mu = rnorm(1,mean = post.mu[k-1], sd = .05)
proposal.log_sigma = rnorm(1, mean = post.log_sigma[k-1], sd = .05)
proposal.P = calc.P(proposal.mu, proposal.log_sigma, y)
# acceptance probability
acceptance = min(1, exp(proposal.P-last.P))
# acceptance decision rule
if (acceptance >= 1) {
post.mu[k] = proposal.mu
post.log_sigma[k] = proposal.log_sigma
last.P = proposal.P
} else {
if (runif(1) < acceptance) {
post.mu[k] = proposal.mu
post.log_sigma[k] = proposal.log_sigma
last.P = proposal.P
} else {
post.mu[k] = post.mu[k-1]
post.log_sigma[k] = post.log_sigma[k-1]
}
}
}
And here is the result of this process:
Metropolis algorithm visualized. Red dots are rejected proposals
Did we learn the correct mean and standard deviation?
We can plot the posterior distribution, which does not include the first 2000 burn in samples in which the sampler tried to find the posterior distribution.
par(mar=c(3,3,2,.5), mgp=c(1.75,.5,0), tck=-.01)
layout(matrix(c(1,3,1,3,2,4), nrow = 2))
posterior.histogram = function(x,xlab) {
plot(1:2000,x, 'l', ylab = xlab, xlab = "iteration",
main = paste0("traceplot ",xlab))
hist(x, main = paste0("mean = ", round(mean(x),2)), xlab = xlab)
}
posterior.histogram(post.mu[-(1:2000)],"mu")
posterior.histogram(exp(post.log_sigma[-(1:2000)]),"sigma")
And to double check, here are arithmetic mean and sd:
c(mean = mean(y), sd = sd(y))
## mean sd
## 0.1088874 0.8981994
While this was a trivial example, it shows us the MCMC, here in the form of the Metropolis algorithm, can compute posterior distributions.
The Metropolis algorithm and improved versions like Gibbs perform generally well, but they have problems when parameters are correlated.
Take for example these data with high colinearity, akin to the 2-legs example:
set.seed(1)
Sigma = matrix(c(1,.95,.95,1), nrow = 2)
X = MASS::mvrnorm(100,mu = c(0,0), Sigma = Sigma)
Y = X %*% c(1,1) + rnorm(100)
XY = cbind(X1 = X[,1], X2 = X[,2], Y = Y)
colnames(XY)[3] = "Y"
par(mar=c(3,3,2,1), mgp=c(1.25,.5,0), tck=-.01)
pairs(XY)
If we estimate a model \(\small Y \sim Normal(\beta_1 X_1 \ + \beta_2 X_2, \sigma)\) the coefficients \(\small \beta_1\) and \(\small \beta_2\) are highly correlated.
Lets see what the Metropolis sampler does with this:
Metropolis sampling for hard problem
We can observe following things:
This is just one example where Metropolis or Gibbs samplers can have difficulties.
ulamOne sampling algorithm that does much better than Metropolis is Gibbs sampling Hamiltonian Monte Carlo. On an intuitive level, the key difference in favor of Hamiltonian Monte Carlo is that it generates proposals less randomly than either Metropolis or Gibbs. Here, less randomly means that the direction from the current sample to the proposal is not random, but that proposals are more likely in direction of the bulk of the posterior distribution. See here for animations of different model estimation algorithm.
To clarify the relationship of different terms:
ulam is a function in the rethinking
packages that translates rethinking models (alist(...))
into the Stan language and runs the HMC samplerbrms and rastanarm, where one can formulate
models like typically in R
(e.g. fit = brms(y ~ x1 + x2, data = dt) ) and
rstan and cmdstandr, which require that that
the use writes the model directly in the Stan progamming language.ulamWe first put the data for the model together:
data.list = list(
Y = as.vector(Y),
X1 = X[,1],
X2 = X[,2]
)
Note that differently than for quap models, we need to
do all transformations before we submit the data to ulam.
It is OK to submit data as data.frame or a list. The latter
is more flexible, which is why we use it.
Next we define the rethinking model:
model = alist(
Y ~ normal(mu,exp(log_sigma)),
mu <- a + b1*X1 + b2*X2,
a ~ dnorm(0,1),
b1 ~ dnorm(0,1),
b2 ~ dnorm(0,1),
log_sigma ~ dnorm(0,1)
)
This is exactly the same model that we also fit the with Metropolis sampling.
Until here, everything was as we know it from quap
models. But now we use the ulam function, which requires a
few additional parameters because we are now actually doing
simulations.
u.fit = ulam(
model,
data = data.list,
iter = 2000, # 2000 iterations, (1000 warmup)
chains = 4, # four chains, to check convergence
cores = 4, # use 4 cores in parallel
cmdstan = TRUE) # use cmdstanr not rstan
Here is the how HMC (via ulam and Stan) explores the
posterior distribution:
.
Maybe the comparison to the Metropolis sampler is not so clear, so lets look at both together:
Metropolis vs HMC
How do we know that the results from ulam are better
than the results from the Metropolis sampling scheme?
First an example: This is an ensemble of 4 healthy chains:
trace.rankplot = function(chains) {
par(mfrow = c(2,1), mar=c(2,3,0.1,.5), mgp=c(1.75,.5,0), tck=-.01)
matplot(chains, typ = "l", lty = 1, ylab = "theta", xlab = "")
ranks = matrix(rank(chains), ncol = ncol(chains))
h = hist(ranks, plot = F, breaks = 30)
rank.hists =
apply(ranks,2,
function(x) counts = hist(x,breaks = h$breaks, plot = F)$counts)
par(mar=c(3,3,0,.5))
matplot(h$mids,rank.hists,'s', ylim = c(0,max(rank.hists)),lty = 1,
xlab = "iteration", ylab = "frequency")
}
chains.well.mixing = matrix(rnorm(4000),ncol = 4)
trace.rankplot(chains.well.mixing)